Round 1: Technical Round - Screening
✅ Tell me about yourself and any recent projects you have been a part of.
✅ Questions related to your projects.
✅ How would you migrate data from an on-premises SQL database and stream data to Azure?
✅ Define Pipeline Creation Strategy on condition bases Scenario (2-3 condition asked)?
✅ Spark - Define Partitioning strategy (Physical & Logical)
✅ PySpark Optimisation techniques and any moment where applied in real time.
✅ Cluster formation and resource calculation on basis of provided data, how to scale if data volume increase or decrease?
✅ How to implement Logic Apps in ADF?
✅ What types of transformations have you performed in your projects?
✅ How can you choose among groupByKey or reduceByKey?
✅ What is SCD Types, and where and how can implement it?
✅ What are the differences between Delta table, Data Lake and Warehouse?
✅ How do you read data from ADLS using SQL Pool in Synapse?
✅ What is a Managed & External Table and Materialised Views and how is it used?
✅ Code based on PySpark (for broadcast joins and left Anti) and SQL (for CTE and analytic functions)?
Round 2: Technical Round - Architecture & Coding
✅ Assign Workspace and detailed use case to prepare powerpoint presentation for high level architecture of pipeline workflow
✅ As per pipeline define Cloud and relevant services as per use cases.
✅ Same use case have some queries need to implement using python with pandas or PySpark.
Round 3: Technical Round - In Depth
✅ Architecture framework in Big Data (when to use what, pros/cons of various techs, arch. principles/guidelines)
✅ Architecture of batch processing
✅ Lambda architecture vs kappa architecture vs Medallion Architecture
✅ What is DAG Scheduler? How does it work in Spark?
✅ Difference between list & Tuple in Python
✅ write Python Code to convert String a3bc2gh into abbbbcgggh
✅ What is window function & different type of window function?
✅ Difference between Union & Union All? Which one is fast?
✅ What is CAP Theorem?
✅ Difference between Cassandra & Mongo DB in terms of technical specification.
✅ Difference between Parquet, AVRO & ORC File Format? Why Hive don't support Parquet ?
✅ Have you used any analytical functions?
✅ Techniques for Query Optimization
✅ What is Catalyst Optimizer?
✅ What is the difference between ELT and ETL? What are the advantages & disadvantages of both? What are the challenges?
✅ Difference between git merge and git rebase.
✅ Difference between MongoDb & Hbase.
✅ Different Questions with SparkSQL, API queries, architecture of Spark & Databricks.
✅ Awarenes about the latest trends into Data Engineering & Data Mesh
✅ Difference between sensor and operators in Airflow, and how to connect Azure (with ADLS and Functions)
Round 4: Techno Managerial
✅ Tell me about yourself, the recent project you were part of, and your roles and responsibilities.
✅ How do you create a delta table in Databricks and how would you manage ACID within same?
✅ Core Principals of Company.
✅ Why do you want to join us and why should you select for this post?
✅ How will manage data governance and country based Data protection Act like HIPAA
✅ General Discussion to know about problem faced in project during development, prod rollout and how did it fix ?
✅ How did you manage team member, customers and different stakeholders and business analyst.
✅ How did you manage project details and versions of code & documents? (read about JIRA, GIT)
Round 5: HR
✅ Discussion around my experience and projects, some resume-based questions
✅ What are you expecting in your next job role & compensation?
✅ How soon can you join the company and what is my preferred location